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A METHOD OF MANAGING WEB SITES REGISTERED IN SEARCH 
ENGINE AND A SYSTEM THEREOF 



Technical Field 

5 The present invention relates to a search engine for providing information 

about web sites on the Internet, and more particularly to a method for managing web 
sites registered in a search engine, wherein information about the web sites registered 
in the search engine is analyzed to prevent the provision of search results different 
from essential contents contained in the web sites. 

10 Background Art 

A conventional search engine, such as Altavista (http://www.altavista.com), 
Lycos (http://www.lycos.com) or Yahoo (http://www.yahoo.com), generally includes a 
database for classifying, storing and managing web site information based on a 
predetermined rule, a search robot, embodied as software, for constantly traveling over 

15 the web and automatically collecting new web site information, and search engine 
software for storing the collected data in a database and allowing a user of the search 
engine to search for desired information in the database. 

Fig. la is a block diagram showing an entire system for providing the search 
engine service. As shown in Fig. la, a user connects to a search engine server 150 

20 over the Internet via a user terminal 110. If the user enters search terms, a search 
engine server 150 queries search engine software 140 about web site information 
corresponding to the entered search terms, and the search engine software 140 searches 
a database 130 to notify the user of retrieved web site information. A search robot 120 
is an entity embodied as software for constantly traveling over the web and 

25 automatically collecting new web site information from a web server 160, as described 
above. The search robot 120 searches for HTML (Hypertext Markup Language) 
documents on a network and parses links described in the HTML documents and then 
collects data from a number of web sites existing on the network. The data collected 
by the search robot 120 is databased. The term "databased" refers to a series of 

30 processes of performing morphological analysis of information located on a web site 
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and producing a corresponding index table and storing it in the database 130. The 
database 130 is provided to store all web site information collected by the search robot 
120. The search engine software 140 functions to show search results to users. This 
software searches a large number of pages stored in the database 130 and lists search 
5 results by relevance to the search term. The conventional search engine as described 
above registers information about a web site in a search engine and provides the 
information to users in the following ways. 

(1) Information of a web site is collected using the search robot as described 
above, and the web site information is registered in the search engine after being 

1 0 reviewed by expert surfers. 

(2) A category corresponding to the subject of a web site to be registered is 
selected from a directory of categories classified by subject, and it is requested that the 
web site be registered in the selected category, and then the web site is registered in the 
search engine after being reviewed by expert surfers. Some search engines provide a 

15 . fee-based directory registration service to reduce the time required to register a web site 
in their directory with a registration fee. 

Web sites registered in the search engine in the above method are provided to a 
user who is looking for desired information after they are searched for in various ways, 
such as integrated web search and directory search, based on search terms entered by 

20 the user. The integrated web search is also called "word-based search", in which 
Universal Resource Locators (URLs) of all web sites are stored in a database and 
desired information is searched for based on a specific keyword entered by the user. 
The directory search is also called "subject-based search", in which web sites are 
organized into subject-based categories and if a user links to a desired category, the user 

25 can view detailed items thereof. In this manner, the subject-based search allows the 
user to continue to link to the detailed items and retrieve desired information. For 
example, if a user desires to find Korean team match scores in the 2002 Korea- Japan 
World Cup, the user can search for them via categories such as Sports -> Ball Sports 
Soccer FIFA World Cup -> 2002 Korea-Japan World Cup -» Korean team match 

30 scores. Fig. lb is an example screenshot of the directory search method. As shown 
in this figure, directory search results with search terms "world cup" are three categories 
"World Cup", "2002 FIFA Korea-Japan World Cup" and "History of the World Cup", 
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and the user can search for desired information by moving to one of the three categories 
in which the desired information is most likely to be placed. A typical search engine 
based on the integrated web search method is Lycos (http://lycos.cs.cmu.edu) developed 
by Michael L. Mauldin at Carnegie-Mellon University, and a typical search engine 
based on the directory search method is Yahoo (http://www.yahoo.com). Many 
current search engines provide hybrid search services based on a combination of the 
different search methods described above. 

The conventional method for registering web sites in the search engine and 
searching for the registered web sites has the following problems. 

As the number of Internet users has rapidly increased, the number of users who 
desire to search for specific information has rapidly increased and the number of types 
of information for which they desire to search has increased. As the number of such 
users and the types of such information has increased, some search terms appear very 
frequently, which will also be referred to as "popular keywords". This causes a 
problem in that users, who desire to search for information based on the popular 
keywords, may receive information of web sites (hereinafter also referred to as 
"deceptive sites") that contain contents of no use to the users and insert the popular 
keywords in their web pages in various ways. For example, if a user enters a popular 
keyword "Pikachu" to search for information about the Pikachu, information of all 
registered web sites that contain the word "Pikachu" in their web pages is provided to 
the user. The web sites provided to the user may include web sites that contain adult 
or sexual contents and insert the word "Pikachu" in some places in their web pages in 
various ways (with ill intention in most cases). This popular keyword insertion causes 
a wide age range of users to be exposed to the information of the web sites that contain 
adult or sexual contents. 

The conventional method for overcoming the problems described above 
requires complaint reports by users or requires specialists such as expert surfers to 
constantly monitor the registered web sites, but the conventional method obviously 
cannot be an ultimate solution to the problems. If an algorithm automatically executed 
on the Internet to solve the problems can be provided, it will be a useful means to solve 
the problems all at once. 
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Disclosure of the Invention 

Therefore, the present invention has been made in view of the above 
problems, and it is an object of the present invention to provide a method for 
managing web sites registered in a search engine, in which an algorithm is used to 
5 automatically detect deceptive sites, thereby allowing users of the search engine to 
correctly search for their desired information. 

It is another object of the present invention to provide a method for managing 
web sites registered in a search engine, in which deceptive sites are automatically 
detected, and punitive measures are automatically imposed on operators of the 

10 detected deceptive sites, thereby reinforcing self-purification of the web sites 
registered in the search engine. 

It is yet another object of the present invention to provide a method for 
managing web sites registered in a search engine, in which an algorithm is used to 
automatically detect deceptive sites and automatically take punitive measures such as 

15 warning against the detected sites, thereby saving a large amount of human resources 
that may otherwise have been wasted to detect the deceptive sites. 

According to a preferred embodiment of the present invention to provide a 
method for managing web sites registered in a search engine, said method comprising 
the steps of: receiving web site information of the registered web site, classifying the 

20 web site information by predetermined fields, and recording the classified web site 
information in a database; reading a source file constituting a web page of the 
registered web site; analyzing the read source file; determining, based on a 
predetermined basis, whether or not the registered web site is a deceptive site; and 
performing a control operation to perform predetermined processing on the registered 

25 web site if the web site is determined to be a deceptive site, wherein the source file is 
an HTML (Hypertext Markup Language) document. 

In addition, according to a preferred embodiment of the present invention to 
provide a system for managing a web site registered in a search engine, the system 
comprising: an interface module for performing data communication with at least one 

30 terminal; a web site registration module for receiving a web site registration request 
including web site information of a predetermined web site from said at least one 
terminal and classifying the web site information by predetermined fields; a database 
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for classifying and storing a predetermined keyword corresponding to the web site and 
the web site information; a web site analysis module for extracting a source file 
constituting a web page of the web site, and analyzing the extracted source file; and a 
web site management module for determining, based on a predetermined basis, 
5 whether or not the web site is a deceptive site. 

As described above, the term "deceptive site" used in the present specification 
refers to a web site that inserts predetermined keywords in a source file of its web 
page in various ways and contains contents entirely different from those to be 
searched for based on the predetermined keywords. According to an embodiment of 

10 the present invention, the predetermined keywords inserted in the source file of the 
web page may be popular keywords. 

The term "popular keywords" refers to search words that appear very 
frequently, among search words entered by Internet users. The popular keywords 
may continually vary depending on the Internet users 1 tendency and social situations 

15 of the time. The popular keywords may include harmful keywords containing 
socially harmful content, and some examples thereof are "suicide", "reject", 
"gambling" and " conspiracy". 

Brief Description of the Drawings 

The above and other objects, features and other advantages of the present 
20 invention will be more clearly understood from the following detailed description 
taken in conjunction with the accompanying drawings, in which: 

Fig. la is a block diagram showing the configuration of a conventional 
system for providing web site search engine services; 

Fig. lb is an example screenshot of a directory search method that is one of the 
25 web site search methods provided by search engines; 

Fig. 2 is a block diagram showing the configuration of a system for managing 
web sites registered in a search engine according to a preferred embodiment of the 
present invention; 

Fig. 3 is a flow chart showing a method for managing web sites registered in 
30 a search engine according to an embodiment of the present invention; 

Figs. 4a to 4k are various types of deceptive sites read by a search robot that 
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travels over the web, in the method for managing web sites registered in the search 
engine according to a preferred embodiment of the present invention; 

Fig. 5 is a flow chart showing a method for imposing a predetermined punitive 
measure on a registrant of a web site that is determined to be a deceptive site, in the 
5 method for managing the web sites registered in the search engine, according to a 
preferred embodiment of the present invention; and 

Fig. 6 is a block diagram showing the internal configuration of a general 
computer system that can be used in managing web pages registered in the search 
engine according to the present invention. 

10 Best Mode for Carrying Out the Invention 

A method for managing web sites registered in a search engine according to 
preferred embodiments of the present invention will now be described in detail with 
reference to the accompanying drawings. 

Fig. 2 is a block diagram showing the configuration of a system for managing 

15 web sites registered in a search engine according to an embodiment of the present 
invention. As shown in Fig. 2, the system according to the embodiment of the present 
invention includes an interface module 201, a web site registration module 202, a web 
site management module 203, a web site information database 204, a web site analysis 
module 205 and a search robot 207. According to the embodiment of the present 

20 invention, the system for managing web sites registered in the search engine may 
include a mail server 208 or an SMS server 209 for sending a predetermined message to 
a registrant of a registered web site. The mail server 208 and the SMS server 209 may 
be provided in a system for providing search engine services or may be located in a 
system operated by a third party. The interface module 201, other various modules, 

25 and the mail server 208 or the SMS server 209 are illustrated in Fig. 2 as separate 
entities. This illustration has been made only for easier explanation, and they may be 
the same entity. The elements shown in Fig. 2 may also be physically located at the 
same place, or alternatively they may be physically located apart from each other 
according to another embodiment of the present invention. 

30 First, the interface module 201 functions to support data transmission between 

the search engine registration management system and a computer terminal provided to 
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a registrant who desires to register a predetermined web site in the search engine, and 
also functions to interface between physical transmission equipment. 

The web site registration module 202 functions to receive a request to register 
the predetermined web site from the registrant, and also to collect and classify 

5 information/data about the web site contained in the web site registration request. The 
web site registration module 202 may further include a billing module (not shown) for 
charging predetermined fees for the web site registration. The billing module may 
operate to charge different fees for a web site desired to be registered, depending on the 
type of the web site (i.e., depending on whether it is a general site containing general 

0 content or an adult site containing adult content). 

The web site management module 203 is a module for overall registration 
management of web sites according to the present invention. Based on information of 
the web sites collected by the search robot 207, the web site management module 203 
determines whether the web sites are in operation in conformity with a standard based 

5 on which their registration has been permitted. If it is determined that the web site is 
in inappropriate operation (i.e., it is a deceptive site), the web site management module 
203 automatically takes a predetermined measure against a registrant of the web site. 
The web site management module 203 can interwork with the mail server 208 or the 
SMS server 209 to send an email to the registrant of the deceptive site or to send an 

;0 SMS message to a mobile terminal of the registrant, thereby giving warning against the 
registrant for the inappropriate operation of the deceptive site. 

The web site information database 204 functions to classify and record 
information of the registered web sites. Various information, such as URLs, 
keywords, registrant information (registrant's name, address, email address, mobile 

15 terminal number, etc.), directory information, and the like of the web sites, may be 
classified by the information fields and stored in the web site information database 204. 

Information of a web site stored in the web site information database 204 may 
be modified by a registrant of the web site and by a system manager. When content of 
a web site is changed, the web site information database 204 may automatically update 

10 information of the web site stored therein, based on analysis results (for example, based 
on a new keyword corresponding to a URL of the web site) of data collected by the 
search robot 207 even though a registrant of the web site does not directly modify the 
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stored information of the web site. 

The web site analysis module 205 functions to analyze information of web sites 
collected by the search robot 207. The type of data collected by the search robot 207 
and a method for analyzing the collected data will be described below in detail with 

5 reference to Fig. 3. 

The above elements of the system for managing web sites registered in the 
search engine according to the embodiment of the present invention are divided simply 
according to their functions for easier explanation, and the functional division of the 
elements has nothing to do with actual physical locations thereof. It is obvious to 

0 those skilled in the art that the above modules may be embodied not only as hardware 
but also as software using a specific code. 

Fig. 3 is a flow chart showing a method for managing web sites registered in a 
search engine according to a preferred embodiment of the present invention. The 
method for managing the web sites registered in the search engine according to the 

5 preferred embodiment of the present invention will now be described in detail with 
reference to Fig. 3 in conjunction with Figs. 4a to 4k and Fig. 6. 

The web site registration management method according to the preferred 
embodiment of the present invention is performed in the following manner, as shown in 
Fig. 3. A registrant, who desires to register a predetermined web site in the search 

10 engine, makes a request to register the web site with information of the web site (305). 
The information of the web site is classified by information fields (registrant's name, 
address, email address, mobile phone number, etc.) and recorded in a web site 
information database (310), and the web site is registered in the search engine (315). 
This registration step 315 may be performed in several ways. For example, in one 

15 way, a web site is registered in the search engine upon request of a manager of the web 
site as described above. In another way, a web site is registered in the search engine 
based on information of the web site obtained by the search robot that randomly travels 
over the web. In the former case, the registrant (i.e., the manager) of the web site can 
request that the web site be registered in a category closest to a subject (for example, 

JO "Pikachu" and "patent bar exam") thereof decided by the registrant. After being 
reviewed by expert surfers, the requested web site can be registered in the search engine 
if it is determined that the requested web site satisfies predetermined requirements (for 
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example, quality of the web site or noncommercial site requirements in case no 
registration fee is paid). The method for managing web sties registered in the search 
engine according to the present invention will be described, limited to the case where 
the web site is registered in the search engine upon request of the registrant of the web 
5 site. However, the method and system for managing web sties registered in the search 
engine according to the present invention can also be applied to other various ways in 
which the web site is registered in the search engine. 

If the web site is registered, the search engine controls the search robot to read 
a source file constituting a web page of the registered web site and analyze the read 
source file (320). 

According to the embodiment of the present invention, the source file analysis 
is based on HTML (Hypertext Markup Language) document analysis. In more detail, 
by analyzing tags in an HTML document of a web site, it can be determined whether the 
web site is a deceptive site that inserts popular keywords (i.e., high frequency search 
words) in an HTML document constituting its web site. As well known to those 
skilled in the art, the HTML document is composed of instructions called "tags", and a 
web designer or the like, who produces web pages, composes a web site using the tags, 
and includes content, which is desired to be provided via the web site, in the web site. 

Figs. 4a to 4k are diagrams illustrating various embodiments of a method for 
analyzing an HTML document of a web site at step 320 of Fig. 3a to determine whether 
the web site is a deceptive site that includes inappropriate character strings in tags 
contained in its HTML document These figures illustrate various ways to detect 
whether a web site is a deceptive site, based on analysis of HTML document tags of the 
web site. A detailed description will now be given of how the HTML document 
analysis is performed in the method for managing web sites registered in the search 
engine according to the present invention, with reference to Figs. 4a to 4k. 

(1) DECEPTIVE SITE USING STRING OF THE SAME COLOR AS 
BACKGROUND COLOR 

Fig. 4a is an example deceptive site that contains character strings enclosed by 
tags, which are the same color as the background color of the deceptive site. In this 
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figure, the left images are screenshots of web sites displayed to users, and the right 
images are HTML source files of the web sites displayed on the left side. As shown in 
Fig. 4a, "#FFFFFF" is assigned to background color and "#FFFFFF" is also assigned to 
text color in the upper source file, so that text "Starcraft" and "Zolaman" are not viewed 
5 in the upper web site screen. In the same manner, "#FFFFFF" indicating white is 
assigned to background color and "white" is also assigned to text color in the lower 
source file of Fig. 4a, so that text "Starcraft" and "Zolaman" are not viewed in the lower 
web site screen. As well known to those skilled in the art, the tag <body> shown in 
the source files of Fig. 4a allows setting of various attributes of text or background 
displayed on a web page. Tags may be mainly classified into container tags composed 
of start and end tags (for example, <body> </body> or <font> </font> shown in Fig. 4a) 
and standalone tags that do not require end tags. These tags may be used to compose a 
web site in various ways. Accordingly, if the background color of a web site is the 
same as the character string color thereof as described above, the web site can be 
displayed on a search results screen with the help of predetermined popular keywords 
even though it contains content unrelated to the popular keywords. 

(2) DECEPTIVE SITE USING STRING CONTAINED IN REDIRECTION 

PAGE 

Fig. 4b is a diagram showing an example deceptive site using character strings 
contained in a redirection page. In this figure, the left image is a screenshot of a web 
site displayed to users, and the right images are HTML source files of the web site 
displayed on the left side. As well known to those skilled in the art, the redirection 
setting allows movement from a connected web site to a new web site, and it can be 
embodied in source files as shown on the right side of Fig. 4b. In Fig. 4b, the upper 
source file uses an http-equiv attribute in the meta tag. The meta tag is generally used 
to set automatic redirection to a different web page within a predetermined time 
specified in a "content" item of Fig. 4b. Typically, if the address of a home page is 
changed, the meta tag is used to automatically redirect a user connecting to an old 
address of the home page to a new address thereof within a predetermined time after 
displaying the address change information. The middle and lower source files in Fig. 
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4b use "self.location" and f! location.replace M tags, respectively, to redirect from the 
current web page to "http://www.naver.com". 

In the example deceptive site shown in Fig. 4b that uses the redirection page, 
the upper source file including the meta tag inserts predetermined popular keywords 
5 "Starcraft" and "Zolaman" next to the redirection instruction, and the middle and lower 
source files insert the predetermined popular keywords "Starcraft' 1 and "Zolaman" next 
to the tag </script>. 

These redirection pages use the tags to instruct movement to different web sites 
and thus text added next to the tags plays no role. However, the search robot provides 

10 search results determined based on the frequency of occurrence of a specific character 
string in a web site, which may cause the subject of the web site to be determined 
differently from its original subject. Accordingly, if a redirection page contains 
character strings as described above, the web site can be displayed on a search results 
screen with the help of popular keywords even though it contains content unrelated to 

15 the popular keywords. 

(3) DECEPTIVE SITE USING STRING IN TITLE TAG 

Fig. 4c is a diagram showing an example deceptive site using character strings 
20 contained in a title tag. In this figure, the left images are screenshots of web sites 
displayed to users, and the right images are HTML source files of the web sites 
displayed on the left side. As well known to those skilled in the art, the title tag is used 
to briefly display the subject of a web site on the top of a web browser, and it can be 
embodied in source files as shown on the right side of Fig. 4c. In Fig. 4c, the upper 
25 source file with a title tag, among the source files shown on the right side, includes a 
plurality of popular keywords such as "Starcraft" and "Zolaman" in the title tag, 
whereby a web browser is displayed as shown on the left side of Fig. 4c. On the other 
hand, the lower source file of Fig. 4c uses a plurality of title tags, where a plurality of 
popular keywords such as "Starcraft" and "Zolaman" are contained in their start and end 
30 tags <title> and </title>. 

The content in these title tags is not displayed on the web browser no matter 
how long character strings the content contains. However, the search robot provides 
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search results determined based on the frequency of occurrence of a specific character 
string in a web site, which may cause the subject of the web site to be determined 
differently from its original subject due to the character strings contained in the title tag. 
Accordingly, as described above, if the length of a character string included in the title 
5 tag is more than a predetermined numerical value, or if the number of title tags is more 
than one, the web site can be displayed on a search results screen with the help of 
popular keywords even though it contains content unrelated to the popular keywords. 

(4) DECEPTIVE SITE USING STRING CONTAINED IN META TAG 

10 Fig. 4d is a diagram showing an example deceptive site using character strings 

contained in a meta tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. 

As well known to those skilled in the art, the meta tag is used to represent 
15 general information about an HTML document, such as an author, data of creation and 
. keywords thereof, which is not displayed on the body of a web page corresponding to 
the HTML document. Referring to the source file on the right side of Fig. 4d, the meta 
tag contains "description" as document name and a plurality of popular keywords such 
as "Starcraft" and "Zolaman" as document content. The character strings, such as the 
20 popular keywords, contained in the meta tag are not displayed on the web page. 
However, the search robot provides search results determined based on the frequency of 
occurrence of a specific character string in a web site, which may cause the subject of 
the web site to be determined differently from its original subject. Accordingly, if a 
meta tag in a web site contains a character string, and the length of the character string 
25 is more than a predetermined numerical value as described above, the web site can be 
displayed on a search results screen with the help of popular keywords even though it 
contains content unrelated to the popular keywords. 

(5) DECEPTIVE SITE USING STRING LOCATED AT FRAME TAG 

30 Fig. 4e is a diagram showing an example deceptive site using character strings 
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located at a frame tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. As well known to those skilled in the art, the frame tag is used to split 
a screen, on which a web page is displayed, into two or more frames. Referring to the 
source file on the right side of Fig. 4e, a frame tag <FRAMESET ROWS=" "> is used to 
split the screen horizontally, where information of the split screen ratio is inserted in " ". 
Character strings located next to the end tag </FRAMESET> of the frame tag include a 
plurality of popular keywords such as "Starcraft" and "Zolaman". The character 
strings, such as the popular keywords, located next to the end frame tag have nothing to 
do with the splitting of the web page screen. However, the search robot provides 
search results determined based on the frequency of occurrence of a specific character 
string in a web site, which may cause the subject of the web site to be determined 
differently from its original subject. Accordingly, if a character string is located at a 
frame tag, and the length of the character string is more than a predetermined numerical 
value as described above, the web site can be displayed on a search results screen with 
the help of popular keywords even though it contains content unrelated to the popular 
keywords. 

(6) DECEPTIVE SITE USING STRING CONTAINED IN FORM TAG 
Fig. 4f is a diagram showing an example deceptive site using character strings 
contained in a form tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. As well known to those skilled in the art, the form tag is used to 
define a desired form in a web page displayed with a web browser. Referring to the 
source file on the right side of Fig. 4f, the form tag may be composed as "<form> 
<input type - ! button type" value="displayed text ,! ></form>". The source file includes 
a button type "hidden" to set no text to be displayed on a corresponding button. 
Character strings shown in the source file, which are not displayed on the web page, 
include a plurality of popular keywords such as "Starcraft" and "Zolaman". The 
character strings, such as the popular keywords, contained in the form tag have nothing 
to do with the definition of a form in the web page. However, the search robot 
provides search results determined based on the frequency of occurrence of a specific 
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character string in a web site, which may cause the subject of the web site to be 
determined differently from its original subject. Accordingly, if the length of a 
character string included in a form tag is more than a predetermined numerical value as 
described above, the web site can be displayed on a search results screen with the help 
5 of popular keywords even though it contains content unrelated to the popular keywords. 

(7) DECEPTIVE SITE USING STRING CONTAINED IN DIV TAG 

Fig. 4g is a diagram showing an example deceptive site using character strings 
10 contained in a div tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. As well known to those skilled in the art, the div tag is used with a 
style sheet, using general ID and class attributes. In the source file on the right side of 
Fig. 4g, the div tag is described as "<div style="display:none; •••>", where an attribute 
15 "style" defining a style of character strings to be displayed on a web page is set as 
"displayrnone", so that the character strings following the div tag are not displayed on 
the web page. The character strings, such as popular keywords, contained in the div 
tag have nothing to do with display of the web page on the screen. However, the 
search robot provides search results determined based on the frequency of occurrence of 
20 a specific character string in a web site, which may cause the subject of the web site to 
be determined differently from its original subject. Accordingly, if the length of a 
character string included in a div tag is more than a predetermined numerical value as 
described above, the web site can be displayed on a search results screen with the help 
of popular keywords even though it contains content unrelated to the popular keywords. 

25 (8) DECEPTIVE SITE USING STRING CONTAINED IN A HREF TAG 



Fig. 4h is a diagram showing an example deceptive site using character strings 
contained in an "a href tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
30 on the left side. As well known to those skilled in the art, the "a href tag is used to 
link a specific word or image in a document to a location or address to move to, so as to 
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facilitate movement to a different location in the same document or to a different 
document or web site. Referring to the source file on the right side of Fig. 4h, the a 
href tag may be composed of M <a href="a location or address to move to"> a link 
marking target </a>". Since no location to move to and no link marking target is 
5 assigned in the a href tag shown in Fig. 4h, the a href tag is not executed as well as 
content therein is not displayed on the web page. Character strings contained in the a 
href tag not to be executed include a plurality of popular keywords such as "Starcraft" 
and "Zolaman". The character strings, such as the popular keywords, contained in the 
href tag have nothing to do with linking or with display of the web page on the screen. 

10 However, the search robot provides search results determined based on the frequency of 
occurrence of a specific character string in a web site, which may cause the subject of 
the web site to be determined differently from its original subject. Accordingly, if the 
length of a character string included in an n a href 1 tag is more than a predetermined 
numerical value as described above, there is a risk that the web site may be displayed on 

15 a search results screen with the help of popular keywords even though it contains 
content unrelated to the popular keywords. 

(9) DECEPTIVE SITE USING LINK FARM 

Fig. 4i is a diagram showing an example deceptive site using a link farm. As 
20 well known to those skilled in the art, the link farm is mostly used to increase the search 
engine ranking of a web page by generating a number of reciprocal links to the web site 
and thus causing the search engine to continually search for the web site. The link 
farm may be realized using the href tags described above. 

There is a problem in directly determining the web site using the link farm to 
25 . be a deceptive site. However, if a web site vises a link farm that includes an excessive 
number of links more than a predetermined number to cause the search engine to 
continually search for popular keywords in the web page, there is a need to detect the 
web site because it is highly likely to be a deceptive site. 
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(10) DECEPTIVE SITE USING STRING CONTAINED IN FONT TAG 
Fig. 4j is a diagram showing an example deceptive site using character strings 
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contained in a font tag. In this figure, the left image is a screenshot of a web site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. 

As well known to those skilled in the art, the font tag is used to set the font size 
5 of character strings. In the source file shown in Fig. 4j, a font size is set to "0" in a 
font tag, so that character strings contained in the font tag are not displayed on the web 
page. In the case where the character strings, which are not displayed on a web page 
due to its font size "0", include a plurality of popular keywords such as "Starcraft" and 
"Zolaman", the character strings, such as the popular keywords, contained in the font 

10 tag have nothing to do with the display of the web page on the screen. However, the 
search robot provides search results determined based on the frequency of occurrence of 
a specific character string in a web site, which may cause the subject of the web site to 
be determined differently from its original subject. Accordingly, if a font tag in a web 
site contains character strings whose font size is zero as described above, the web site 

15 can be displayed on a search results screen with the help of popular keywords even 
though it contains content unrelated to the popular keywords. 

(1 1) DECEPTIVE SITE USING STRING CONTAINED IN IMAGE TAG 

Fig. 4k is a diagram showing an example deceptive site using character strings 
contained in an img tag. In this figure, the left image is a screenshot of a web. site 
displayed to users, and the right image is an HTML source file of the web site displayed 
on the left side. 

As well known to those skilled in the art, the img tag is used to insert a specific 
image in a document. In the source file shown in Fig. 4k, "a.gif 1 is assigned as an 
image file to be inserted. After assigning the image to be inserted, the img tag 
generally specifies an attribute such as a location or an alignment method of the image. 
In the case of Fig. 4k, such an attribute is specified vising character strings. When the 
image is displayed on the web browser, the attribute specified using the character 
strings has no influence on the display of the image. In the case where the character 
strings having no influence on the attribute of the image include a plurality of popular 
keywords such as "Starcraft" and "Zolaman", the character strings, such as the popular 
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keywords, contained in the img tag have nothing to do with the display on the web 
browser screen. However, the search robot provides search results determined based 
on the frequency of occurrence of a specific character string in a web site, which may 
cause the subject of the web site to be determined differently from its original subject. 
5 Accordingly, if the length of a character string included in an img tag is more than a 
predetermined numerical value as described above, the web site can be displayed on a 
search results screen with the help of popular keywords even though it contains content 
unrelated to the popular keywords. 

At step 320 of Fig. 3, a tag or the like contained in an HTML document 

10 corresponding to a web site is analyzed, and the length of character strings contained in 
the tag or the like is measured as described in the above embodiments. 
At step 325, according to a predetermined basis based on this measurement 
result, it is determined whether the web site is a deceptive site. For example, 
wherein the predetermined basis is whether or not the HTML document includes a 

15 character string of the same color as background color of the web page or wherein the 
predetermined basis is whether or not a redirection tag in the HTML document 
includes a character string. 

Examples of the predetermined basis at step 325 to determine whether the web 
site is a deceptive site are as described above with reference to Figs. 4a to 4k. For 

20 example, the predetermined basis may be whether or not the HTML document 
includes a character string of the same color as background color of the web page, or 
whether or not a redirection tag in the HTML document includes a character string. 

According to a preferred embodiment of the present invention, a hybrid of the 
analyses described above in the deceptive site types (1) to (11) is used as the 

25 predetermined basis at step 325, and if the analysis value is more than a predetermined 
value, it is determined that the web site is a deceptive site. For example, if the number 
of title character strings contained in a title tag is more than one, 10 points may be 
added to the analysis value for each string, and up to 70 points may be added thereto. 
If a redirection page includes character strings, 70 points may be added to the analysis 

30 value irrespective of the number of the character strings. For a link farm, 4 points per 
50 links, up to 80 points, may be added to the analysis value. If there are character 
strings whose font size is "0", 5 points per 100 bytes of the character strings, up to 70 
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points, may be added to the analysis value. A source file constituting a web page is 
analyzed in this manner, and if a total analysis value of a web site, calculated using 
points and weighted values obtained respectively based on the above various bases, is 
more than 100 points, the web site may be determined to be a deceptive site. If the 
5 deceptive site determination is based on only one basis (for example, a web site is 
determined to be a deceptive site since the number of character strings contained in a 
title tag of the web site is 50), the determination is highly likely to be erroneous. It is 
thus preferable that the deceptive site determination be made based on a combination of 
the various bases. 

0 According to a preferred embodiment of the present invention, different 

predetermined bases for deceptive site determination may be applied to web sites 
registered in a robot-based search engine and web sites registered in a directory-based 
search engine. For example, if source file analysis of a web page corresponding to a 
web site registered in the robot-based search engine shows that the web site belongs to 

5 three of the 1 1 deceptive site types describe above, the web site is determined to be a 
deceptive site. On the other hand, a web site registered in the directory-based search 
engine is determined to be a deceptive site even if it belongs to only one of the 11 
deceptive site types. This is because directory-based search engine providers, 
compared to robot-based search engine providers with web sites registered therein 

0 without registration fees, need to return favor to registrants of web sites since most of 
the directory-based search engine providers receive registration fees from the 
registrants. 

If the web site is determined to be a deceptive site at step 325, the registrant 
field of the database described above is searched to obtain information of a registrant of 
5 the web site (330). Contact information of the registrant is extracted from the 
registrant information of the web site (335). Warning is given to the registrant of the 
web site by sending an email or an SMS message to the registrant using the extracted 
contact information (340). The warning will be described below in detail with 
reference to Fig. 5. 

0 According to another embodiment of the present invention, an image described 

in a tag of a web site may be analyzed at step 320. For example, pixels of the image 
are analyzed to extract RGB components of the pixels, and if the number of pixels of a 
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specific color (for example, yellow) of the extracted RGB components exceeds a 
predetermined reference value (for example, if the number of pixels is 50% or more), 
the web site may be considered a site containing obscene content, based on which it 
may be determined whether the web site is a deceptive site. 
5 Fig. 5 is a flow chart showing a method for imposing a predetermined punitive 

measure on a registrant of a web site that is determined to be a deceptive site or an 
altered site, in the method for managing the web sites registered in the search engine, 
according to a preferred embodiment of the present invention. 

With reference to Fig. 5, a description will now be given of how a punitive 

10 measure is automatically taken against a web site when it is determined to be a 
deceptive site at step 325 of Fig. 3. If the web site is determined to be a deceptive site, 
a web site management module searches a web site information database to obtain 
information of a registrant of the web site (510), and the web site management module 
receives the registrant information (520 and 550). According to an embodiment of the 

15 present invention, the web site management module extracts contact information of the 
registrant, such as an email address or a mobile terminal number thereof, from the 
received registrant information (530), and controls a mail server or an SMS server to 
transmit a predetermined message to a location corresponding to the contact information 
(540). 

20 According to another embodiment of the present invention, the web site 

management module extracts information of other registered web sites of the registrant 
from the registrant information (560), and then performs a control operation to 
automatically analyze the other web sites registered under the same registrant name 
(570). This is because the other web sites registered under the same registrant name 

25 are highly likely to be deceptive sites operated based on the same or similar method. 
In this embodiment, if, based on the analysis of the other registered web sites, it is 
determined that they are deceptive sites, step 510 of Fig. 5 may be repeated. 

According to. a preferred embodiment of the present invention, if a web site is 
determined to be a deceptive site based on the analysis and determination methods, the 

30 system for managing the registered web sites may operate to automatically send an 
email, an SMS message or the like to a registrant of the web site to point out problems 
of the web site and then request that the registrant of the web site correct the problems 
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within a grace period. In addition, the system may be set to automatically perform the 
analysis and determination processes after the grace period. If the problems of the web 
site have not been corrected even after the grace period, a punitive measure, such as 
cancel of the registration of the web site, may be taken against the registrant thereof. 
5 According to another embodiment of the present invention, a punitive measure such as a 
complicated registration procedure may be imposed on the registrant of the web site 
when the registrant requests registration of another web site at a later time. 

Embodiments of the present invention further relate to computer readable 
media that include program instructions for performing various computer-implemented 

10 operations. The media may also include, alone or in combination with the program 
instructions, data files, data structures, tables, and the like. The media and program 
instructions may be those specially designed and constructed for the purposes of the 
present invention, or they may be of the kind well known and available to those having 
skill in the computer software arts. Examples of computer-readable media include 

15 magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such 
as CD-ROM disks; magneto-optical media such as floptical disks; and hardware devices 
that are specially configured to store and perform program instructions, such as read- 
only memory devices (ROM) and random access memory (RAM). The media may 
also be a transmission medium such as optical or metallic lines, wave guides, etc. 

20 including a carrier wave transmitting signals specifying the program instructions, data 
structures, etc. Examples of program instructions include both machine code, such as 
produced by a compiler, and files containing higher level code that may be executed by 
the computer using an interpreter. 

Fig. 6 is a block diagram showing the internal configuration of a general 

25 computer system that can be used in managing web pages registered in the search 
engine according to the present invention. 

The computer system includes any number of processors 640 (also referred to 
as central processing units, or CPUs) that are coupled to storage devices including 
primary storage 660 (typically a random access memory, or "RAM"), primary storage 

30 670 (typically a read only memory, or "ROM"). As is well known in the art, primary 
storage 660 acts to transfer data and instructions uni-directionally to the CPU and 
primary storage 660 is used typically to transfer data and instructions in a bi-directional 
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manner. Both of these primary storage devices may include any suitable type of the 
computer-readable media described above. A mass storage device 610 is also coupled 
bi-directionally to CPU 640 and provides additional data storage capacity and may 
include any of the computer-readable media described above. The mass storage device 
5 610 may be used to store programs, data and the like and is typically a secondary 
storage medium such as a hard disk that is slower than primary storage. A specific 
mass storage device such as a CD-ROM 620 may also pass data uni-directionally to the 
CPU. Processor 640 is also coupled to an interface 630 that includes one or more 
input/output devices such as such as video monitors, track balls, mice, keyboards, 

10 microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape 
readers, tablets, styluses, voice or handwriting recognizers, or other well-known input 
devices such as, of course, other computers. Finally, processor 640 optionally may be 
coupled to a computer or telecommunications network using a network connection as 
shown generally at 650 With such a network connection, it is contemplated that the 

15 CPU might receive information from the network, or might output information to the 
network in the course of performing the above-described method steps. The above- 
described devices and materials will be familiar to those of skill in the computer 
hardware and software arts. 

The hardware elements described above may be configured (usually 

20 temporarily) to act as one or more software modules for performing the operations of 
this invention. 

Industrial Applicability 

According to a method for managing web sites registered in a search engine, 
25 in which an algorithm is used to automatically detect deceptive sites, thereby allowing 
users of the search engine to correctly search for their desired information. 

According to a method for managing web sites registered in a search engine, 
in which deceptive sites are automatically detected, and punitive measures are 
automatically imposed on operators of the detected deceptive sites, thereby reinforcing 
30 self-purification of the web sites registered in the search engine. 

According to a method for managing web sites registered in a search engine, 
in which an algorithm is used to automatically detect deceptive sites and automatically 
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take punitive measures such as warning against the detected sites, thereby saving a 
large amount of human resources that may otherwise have been wasted to detect the 
deceptive sites. 

The foregoing descriptions of specific embodiments of the present invention 
5 have been presented for purposes of illustration and description. They are not intended 
to be exhaustive or to limit the invention to the precise forms disclosed, and obviously 
many modifications and variations are possible in light of the above teaching. The 
embodiments were chosen and described in order to best explain the principles of the 
invention and its practical application, to thereby enable others skilled in the art to best 
10 utilize the invention and various embodiments with various modifications as are suited 
to the particular use contemplated. It is intended that the scope of the invention be 
defined by the claims appended hereto and their equivalents. 



